| Provider | Model | Version | Estimate | Rank | |
|---|---|---|---|---|---|
| 1 | anthropic | Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 | 3.8580848 | top |
| 2 | anthropic | Claude 3.5 Sonnet | claude-3-5-sonnet-20241022 | 3.4210271 | top |
| 3 | xai | Grok 3 Beta | grok-3-beta | 3.0488472 | top |
| 4 | anthropic | Claude 3 Haiku | claude-3-haiku-20240307 | 0.3764656 | bottom |
| 5 | cohere | Command R | command-r-08-2024 | 0.3764656 | bottom |
| 6 | openai | GPT-3.5 Turbo | gpt-3.5-turbo | 0.3299676 | bottom |
| 7 | openai | GPT-4o Mini | gpt-4o-mini | 0.2865677 | bottom |
| 8 | Gemini 2.5 Flash | gemini-2.5-flash | NA | new |
Building on our previous analysis, we selected models based on their performance. We chose 4 top1, which were consistently more consistent than chance, and 4 bottom models, which were consistently less consistent than chance in terms of deliberative reasoning.
| Case | Survey | N Participants | |
|---|---|---|---|
| 1 | CCPS ACT Deliberative | ccps | 31 |
| 2 | CSIRO WA | energy_futures | 17 |
| 3 | Winterthur | zh_winterthur | 16 |
| survey | considerations | policies | scale_max | q_method | |
|---|---|---|---|---|---|
| 1 | ccps | 33 | 7 | 11 | FALSE |
| 2 | energy_futures | 45 | 9 | 11 | FALSE |
| 3 | zh_winterthur | 30 | 6 | 7 | FALSE |
| uid | type | article | role | description | |
|---|---|---|---|---|---|
| 1 | eco | ideology | an | ecologist | focuses on environmental protection and sustainability, advocating for societal change to ecological limits |
| 2 | coa | perspective | a | coastal resident | endures chronic flooding and salinization, forced to relocate due to rising sea levels and intense storms worsened by climate change |
| 3 | ctr | perspective | a | construction worker | suffers from extreme heat stress and lost work hours, perceiving climate change making outdoor labor unbearable and life-threatening |
| 4 | dis | perspective | a | disease survivor | recovers from dengue fever, aware that climate change’s rising temperatures are expanding the range of disease-carrying mosquitoes in their region |
| 5 | eld | perspective | an | elderly urban resident | endures intensified city heatwaves, struggling with disrupted services and feeling the direct, severe impact of climate change |
| 6 | far | perspective | a | displaced family | loses their home due to unprecedented wildfires, experiencing displacement and recognizing climate change as the major driver of the devastation |
| 7 | fis | perspective | a | fisher | notes his declining catches due to warming oceans, understanding that climate change is reorganizing marine life and reducing their traditional yield |
| 8 | lan | perspective | a | landowner | surveys his parched fields after a prolonged drought, feeling the compounding impacts of climate change that reduce crop yields and family income |
| 9 | par | perspective | a | parent | sees their child fall ill from a water-borne disease, attributing its spread to the increased heavy rainfall and warmer temperatures brought by climate change |
| 10 | sub | perspective | a | subsistence farmer | watches his crops wither under erratic rainfall patterns, and who sees these changes as direct consequence of climate change |
| 11 | vil | perspective | a | villager | faces dwindling, contaminated water supplies due to extended draughts and floods, aware that climate change is altering their water security |
| 12 | csk | devils | a | climate skeptic | prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science |
We collected 1440 responses generated by 8 models cross 3 surveys and 12 roles described above. We prompted each LLM 5 times with the same prompt.
We instructed LLMs to play each of the roles described above by including a system instruction in each request following the pattern:
Answer the following prompts as [article] [role], who [description].
For example:
Answer the following prompts as a climate skeptic, who prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science.
We calculated one DRI value per model/survey/role by treating each LLM response as one participant in a deliberation. The role “all” indicates that all roles were part of that deliberation (n = 60 participants, which equals 5 participants for each of the 12 roles). DRI plots are shown in Figure 7.3.
| model | survey | obs_mean | N | mu | p_value_two.sided | sig_two.sided | p_value_greater | sig_greater |
|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | ccps | 0.3759073 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Claude 3.5 Sonnet | energy_futures | 0.4695921 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Claude 3.5 Sonnet | zh_winterthur | 0.5683774 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | ccps | 0.6819898 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | energy_futures | 0.6173198 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Claude 3.7 Sonnet | zh_winterthur | 0.5911667 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | ccps | 0.3605863 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | energy_futures | 0.7103851 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Grok 3 Beta | zh_winterthur | 0.7314191 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Gemini 2.5 Flash | ccps | 0.8336696 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| Gemini 2.5 Flash | energy_futures | 0.5166190 | 12 | 0 | 0.0009766 | * | 0.0004883 | * |
| Gemini 2.5 Flash | zh_winterthur | 0.6778375 | 12 | 0 | 0.0004883 | * | 0.0002441 | * |
| GPT-4o Mini | ccps | 0.0427425 | 12 | 0 | 0.6772461 | n.s. | 0.3386230 | n.s. |
| GPT-4o Mini | energy_futures | -0.0899976 | 12 | 0 | 0.5693359 | n.s. | 0.7407227 | n.s. |
| GPT-4o Mini | zh_winterthur | -0.2190937 | 12 | 0 | 0.0771484 | n.s. | 0.9680176 | n.s. |
| GPT-3.5 Turbo | ccps | -0.2532340 | 12 | 0 | 0.0161133 | * | 0.9938965 | n.s. |
| GPT-3.5 Turbo | energy_futures | -0.2836284 | 12 | 0 | 0.0122070 | * | 0.9953613 | n.s. |
| GPT-3.5 Turbo | zh_winterthur | -0.4205772 | 12 | 0 | 0.0034180 | * | 0.9987793 | n.s. |
| Command R | ccps | -0.4709172 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Command R | energy_futures | -0.0245292 | 12 | 0 | 0.7910156 | n.s. | 0.6333008 | n.s. |
| Command R | zh_winterthur | -0.9582444 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Claude 3 Haiku | ccps | -0.3105968 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
| Claude 3 Haiku | energy_futures | -0.3584220 | 12 | 0 | 0.0009766 | * | 0.9997559 | n.s. |
| Claude 3 Haiku | zh_winterthur | -0.6380549 | 12 | 0 | 0.0004883 | * | 1.0000000 | n.s. |
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | role) + (1 | survey)
## Data: df
##
## REML criterion at convergence: 127.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.86198 -0.63430 0.03286 0.59691 3.03838
##
## Random effects:
## Groups Name Variance Std.Dev.
## role (Intercept) 0.002483 0.04983
## survey (Intercept) 0.005538 0.07442
## Residual 0.080233 0.28326
## Number of obs: 288, groups: role, 12; survey, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.43569 0.06544 -6.658
## modelClaude 3.5 Sonnet 0.90698 0.06676 13.585
## modelClaude 3.7 Sonnet 1.06585 0.06676 15.964
## modelCommand R -0.04887 0.06676 -0.732
## modelGemini 2.5 Flash 1.11173 0.06676 16.652
## modelGPT-3.5 Turbo 0.11654 0.06676 1.746
## modelGPT-4o Mini 0.34691 0.06676 5.196
## modelGrok 3 Beta 1.03649 0.06676 15.525
##
## Correlation of Fixed Effects:
## (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.510
## mdlCld3.7Sn -0.510 0.500
## modelCmmndR -0.510 0.500 0.500
## mdlGmn2.5Fl -0.510 0.500 0.500 0.500
## mdlGPT-3.5T -0.510 0.500 0.500 0.500 0.500
## modlGPT-4Mn -0.510 0.500 0.500 0.500 0.500 0.500
## modelGrk3Bt -0.510 0.500 0.500 0.500 0.500 0.500 0.500
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | survey/role)
## Data: df
##
## REML criterion at convergence: 128.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.95607 -0.66969 -0.00041 0.65619 3.06045
##
## Random effects:
## Groups Name Variance Std.Dev.
## role:survey (Intercept) 0.0009013 0.03002
## survey (Intercept) 0.0054477 0.07381
## Residual 0.0817355 0.28589
## Number of obs: 288, groups: role:survey, 36; survey, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.43569 0.06412 -6.795
## modelClaude 3.5 Sonnet 0.90698 0.06739 13.460
## modelClaude 3.7 Sonnet 1.06585 0.06739 15.817
## modelCommand R -0.04887 0.06739 -0.725
## modelGemini 2.5 Flash 1.11173 0.06739 16.498
## modelGPT-3.5 Turbo 0.11654 0.06739 1.730
## modelGPT-4o Mini 0.34691 0.06739 5.148
## modelGrok 3 Beta 1.03649 0.06739 15.381
##
## Correlation of Fixed Effects:
## (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.525
## mdlCld3.7Sn -0.525 0.500
## modelCmmndR -0.525 0.500 0.500
## mdlGmn2.5Fl -0.525 0.500 0.500 0.500
## mdlGPT-3.5T -0.525 0.500 0.500 0.500 0.500
## modlGPT-4Mn -0.525 0.500 0.500 0.500 0.500 0.500
## modelGrk3Bt -0.525 0.500 0.500 0.500 0.500 0.500 0.500
## boundary (singular) fit: see help('isSingular')
## refitting model(s) with ML (instead of REML)
## Data: df
## Models:
## m0: dri ~ 1 + (1 | survey/role)
## m1: dri ~ model + (1 | survey/role)
## npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
## m0 4 490.84 505.49 -241.420 482.84
## m1 11 118.59 158.89 -48.297 96.59 386.25 7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## model emmean SE df lower.CL upper.CL
## Gemini 2.5 Flash 0.6760 0.0641 7.44 0.526 0.8259
## Claude 3.7 Sonnet 0.6302 0.0641 7.44 0.480 0.7800
## Grok 3 Beta 0.6008 0.0641 7.44 0.451 0.7506
## Claude 3.5 Sonnet 0.4713 0.0641 7.44 0.321 0.6211
## GPT-4o Mini -0.0888 0.0641 7.44 -0.239 0.0611
## GPT-3.5 Turbo -0.3191 0.0641 7.44 -0.469 -0.1693
## Claude 3 Haiku -0.4357 0.0641 7.44 -0.586 -0.2859
## Command R -0.4846 0.0641 7.44 -0.634 -0.3347
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## # A tibble: 12 × 3
## role mean_dri sd_dri
## <chr> <dbl> <dbl>
## 1 coa 0.125 0.547
## 2 csk 0.287 0.550
## 3 ctr 0.189 0.457
## 4 dis 0.0416 0.564
## 5 eco 0.141 0.638
## 6 eld 0.149 0.531
## 7 far 0.0617 0.612
## 8 fis 0.0519 0.604
## 9 lan 0.170 0.506
## 10 par 0.111 0.608
## 11 sub 0.210 0.541
## 12 vil 0.0379 0.616
## # A tibble: 12 × 4
## role mean_role_noise max_role_noise min_role_noise
## <chr> <dbl> <dbl> <dbl>
## 1 coa 0.246 0.549 0.116
## 2 csk 0.187 0.370 0.00776
## 3 ctr 0.299 0.402 0.106
## 4 dis 0.217 0.369 0.00799
## 5 eco 0.233 0.517 0.0277
## 6 eld 0.245 0.724 0.0452
## 7 far 0.221 0.373 0.0647
## 8 fis 0.192 0.566 0.0365
## 9 lan 0.251 0.442 0.121
## 10 par 0.304 0.559 0.0512
## 11 sub 0.349 0.685 0.128
## 12 vil 0.301 0.571 0.0186
##
## Fligner-Killeen test of homogeneity of variances
##
## data: sd_rep by role
## Fligner-Killeen:med chi-squared = 8.0891, df = 11, p-value = 0.7053
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.8873 0.08108 .
## 88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 7 1.8873 0.08108 .
## 88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Question | Statement | Response Type |
|---|---|---|
| C1 | There is not enough information to definitively say that climate change is real. | Likert from 1 to 11 |
| C2 | The response to climate change is not going to be positive. The same mistakes will keep happening. | Likert from 1 to 11 |
| C3 | Climate variation is normal, so why should this be a problem? | Likert from 1 to 11 |
| C4 | More educational programmes are needed to increase public awareness about climate change. | Likert from 1 to 11 |
| C5 | Climate change will not be a problem because there will be technological solutions available. | Likert from 1 to 11 |
| C6 | I don’t trust what scientists say about climate change. | Likert from 1 to 11 |
| C7 | I don’t trust what I hear about climate change from government. | Likert from 1 to 11 |
| C8 | We need strong political leadership to do something about climate change. | Likert from 1 to 11 |
| C9 | I think it is safe to say climate change is here. | Likert from 1 to 11 |
| C10 | I’m not going to do anything to address climate change because it is not a major issue. | Likert from 1 to 11 |
| C11 | There’s not much point in me doing anything to fix this. No-one else is going to. | Likert from 1 to 11 |
| C12 | It’s difficult to trust what comes out in the media on the issue of climate change. | Likert from 1 to 11 |
| C13 | It is already too late to do anything, as any action to stop climate change will take a long time to take effect. | Likert from 1 to 11 |
| C14 | I’m not concerned enough to do anything drastic about this, such as participate in political action. | Likert from 1 to 11 |
| C15 | It is unfair that we are going to leave the climate in a mess for future generations. | Likert from 1 to 11 |
| C16 | We should pay for greenhouse emissions. | Likert from 1 to 11 |
| C17 | We can adapt to the coming changes. | Likert from 1 to 11 |
| C18 | It is clear that we are already entering the zone of dangerous climate change. | Likert from 1 to 11 |
| C19 | I care about the planet. | Likert from 1 to 11 |
| C20 | I don’t know what to do. I’m very concerned and would like to do something, but I don’t have a realistic shortlist of things that would really make a difference. | Likert from 1 to 11 |
| C21 | Australia does not owe it to the rest of the world to reduce emissions and suffer economically. | Likert from 1 to 11 |
| C22 | If Australia reduces greenhouse gases it won’t make a difference. That will just shift Australian jobs to other countries. | Likert from 1 to 11 |
| C23 | This is so depressing and is so out of our control. | Likert from 1 to 11 |
| C24 | I believe that the difference we can have as an individual, in Australia, is so minimal that our actions are worthless. | Likert from 1 to 11 |
| C25 | Australia is particularly vulnerable to climate change, and it is in our interest to help find an effective global solution. | Likert from 1 to 11 |
| C26 | We need laws addressing climate change because people are not going to volunteer to change. | Likert from 1 to 11 |
| C27 | I want to do something, but it is too big and too hard. | Likert from 1 to 11 |
| C28 | When I read in the paper that climate change is not true, I start to have doubts about whether it is changing. | Likert from 1 to 11 |
| C29 | Doing something to reduce emissions feels a bit hopeless but I just want to feel that I’m doing the most I can. | Likert from 1 to 11 |
| C30 | The fate of the planet is too important to be left to market forces. | Likert from 1 to 11 |
| C31 | Australia’s emissions are tiny, so it’s not up to us to act. | Likert from 1 to 11 |
| C32 | Governments should take a far greater role in preparing towns and cities to adapt to the impacts of climate change. | Likert from 1 to 11 |
| C33 | Failure to address climate change is the fault of political leaders. | Likert from 1 to 11 |
| P1 | Leave the policy settings as they are. | Ranked-choice from 1 to 7 |
| P2 | Policies that emphasise economic growth over climate change adaptation or mitigation. | Ranked-choice from 1 to 7 |
| P3 | Policies that involve a dramatic cut back in CO2 emissions (by 50% in the next 10 years). | Ranked-choice from 1 to 7 |
| P4 | Policies that involve a moderate cut in CO2 emissions (by 25% in over the next 10 years). | Ranked-choice from 1 to 7 |
| P5 | Adaptation policies and expenditure (e.g. coastal protection, water desalinisation, improving infrastructure etc). Planning controls and emergency response programs. | Ranked-choice from 1 to 7 |
| P6 | Adaptation policies that target individual, small business or community-based actions (eg support the installation of alternative energy generators, insulation, water use efficiency) | Ranked-choice from 1 to 7 |
| P7 | Preparing for climate risk through the development of new approaches and technologies that enhance resilience to the impacts of climate variability or change. | Ranked-choice from 1 to 7 |
We compared the compared top with bottom models in terms of consistency of DRI and Cronbach’s Alpha (see top models in Figure 7.1 and bottom models in Figure 7.2).
Figure 7.1: Top models
We found that top LLMs are consistent across roles both in terms of DRI and Cronbach’s Alpha (policies). The high DRI across roles (median = 0.637; IQR = 0.161) suggests that LLMs tend to consistenly align their considerations and policy preferences. The high Cronbach’s alpha for their policy preferences (median = 0.784; IQR = 0.047) suggests that LLMs tend to agree on the ranking of their policy preferences.
Figure 7.2: Bottom models
We also found that bottom LLMs are not consistent across roles in terms of DRI and less consistent than top models in terms of Cronbach’s Alpha (policies). The low DRI across roles (median = -0.177; IQR = 0.163) suggests that LLMs tend to consistenly misalign their considerations and policy preferences. The Cronbach’s alpha (lower than top models) for their policy preferences (median = 0.635; IQR = 0.11) suggests that LLMs tend to agree less on the ranking of their policy preferences than top models.
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.512 | 0.639 | -0.291 | -0.281 | 0.638 | -0.213 | 0.000 | 0.625 | claude-3-7-sonnet-20250219 |
| 2 | coa | 0.350 | 0.565 | -0.526 | -0.435 | 0.810 | -0.315 | -0.019 | 0.567 | gemini-2.5-flash |
| 3 | csk | 0.543 | 0.773 | -0.118 | -0.580 | 0.875 | 0.163 | -0.153 | 0.795 | gemini-2.5-flash |
| 4 | ctr | 0.343 | 0.567 | -0.368 | -0.264 | 0.663 | -0.129 | 0.252 | 0.447 | gemini-2.5-flash |
| 5 | dis | 0.476 | 0.538 | -0.553 | -0.490 | 0.569 | -0.719 | 0.057 | 0.455 | gemini-2.5-flash |
| 6 | eco | 0.364 | 0.720 | -0.281 | -0.831 | 0.854 | -0.472 | 0.084 | 0.696 | gemini-2.5-flash |
| 7 | eld | 0.404 | 0.498 | -0.335 | -0.396 | 0.796 | -0.078 | -0.322 | 0.626 | gemini-2.5-flash |
| 8 | far | 0.479 | 0.651 | -0.524 | -0.673 | 0.821 | -0.388 | -0.370 | 0.497 | gemini-2.5-flash |
| 9 | fis | 0.497 | 0.593 | -0.492 | -0.560 | 0.685 | -0.665 | -0.244 | 0.602 | gemini-2.5-flash |
| 10 | lan | 0.595 | 0.633 | -0.318 | -0.347 | 0.477 | -0.466 | 0.199 | 0.587 | claude-3-7-sonnet-20250219 |
| 11 | par | 0.498 | 0.708 | -0.669 | -0.472 | 0.598 | -0.164 | -0.284 | 0.670 | claude-3-7-sonnet-20250219 |
| 12 | sub | 0.526 | 0.712 | -0.433 | -0.218 | 0.556 | -0.106 | -0.014 | 0.654 | claude-3-7-sonnet-20250219 |
| 13 | vil | 0.581 | 0.604 | -0.612 | -0.550 | 0.407 | -0.490 | -0.252 | 0.613 | grok-3-beta |
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.725 | 0.792 | 0.614 | 0.638 | 0.801 | 0.599 | 0.641 | 0.818 | grok-3-beta |
| 2 | coa | 0.713 | 0.745 | 0.816 | 0.808 | 0.771 | 0.737 | 0.763 | 0.807 | claude-3-haiku-20240307 |
| 3 | csk | 0.783 | 0.802 | 0.813 | 0.708 | 0.848 | 0.764 | 0.715 | 0.851 | grok-3-beta |
| 4 | ctr | 0.749 | 0.791 | 0.774 | 0.776 | 0.918 | 0.787 | 0.727 | 0.755 | gemini-2.5-flash |
| 5 | dis | 0.761 | 0.772 | 0.669 | 0.802 | 0.771 | 0.762 | 0.756 | 0.796 | command-r-08-2024 |
| 6 | eco | 0.764 | 0.844 | 0.711 | 0.730 | 0.814 | 0.800 | 0.759 | 0.716 | claude-3-7-sonnet-20250219 |
| 7 | eld | 0.722 | 0.793 | 0.788 | 0.740 | 0.741 | 0.801 | 0.813 | 0.828 | grok-3-beta |
| 8 | far | 0.726 | 0.807 | 0.791 | 0.843 | 0.827 | 0.769 | 0.828 | 0.824 | command-r-08-2024 |
| 9 | fis | 0.787 | 0.792 | 0.690 | 0.793 | 0.829 | 0.750 | 0.825 | 0.704 | gemini-2.5-flash |
| 10 | lan | 0.715 | 0.792 | 0.802 | 0.805 | 0.789 | 0.783 | 0.795 | 0.792 | command-r-08-2024 |
| 11 | par | 0.785 | 0.704 | 0.774 | 0.777 | 0.790 | 0.778 | 0.762 | 0.833 | grok-3-beta |
| 12 | sub | 0.841 | 0.800 | 0.671 | 0.754 | 0.761 | 0.760 | 0.803 | 0.839 | claude-3-5-sonnet-20241022 |
| 13 | vil | 0.708 | 0.818 | 0.770 | 0.794 | 0.808 | 0.786 | 0.798 | 0.662 | claude-3-7-sonnet-20250219 |
| role | claude-3-5-sonnet-20241022 | claude-3-7-sonnet-20250219 | claude-3-haiku-20240307 | command-r-08-2024 | gemini-2.5-flash | gpt-3.5-turbo | gpt-4o-mini | grok-3-beta | best_model | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | all | 0.990 | 0.990 | 0.976 | 0.975 | 0.984 | 0.911 | 0.976 | 0.987 | claude-3-5-sonnet-20241022 |
| 2 | coa | 0.863 | 0.918 | 0.880 | 0.787 | 0.849 | 0.886 | 0.837 | 0.891 | claude-3-7-sonnet-20250219 |
| 3 | csk | 0.769 | 0.856 | 0.898 | 0.767 | 0.551 | 0.952 | 0.817 | 0.831 | gpt-3.5-turbo |
| 4 | ctr | 0.916 | 0.909 | 0.872 | 0.915 | 0.852 | 0.916 | 0.852 | 0.906 | claude-3-5-sonnet-20241022 |
| 5 | dis | 0.905 | 0.921 | 0.894 | 0.904 | 0.859 | 0.918 | 0.876 | 0.896 | claude-3-7-sonnet-20250219 |
| 6 | eco | 0.900 | 0.860 | 0.884 | 0.827 | 0.842 | 0.865 | 0.871 | 0.863 | claude-3-5-sonnet-20241022 |
| 7 | eld | 0.917 | 0.899 | 0.919 | 0.886 | 0.917 | 0.911 | 0.879 | 0.903 | claude-3-haiku-20240307 |
| 8 | far | 0.905 | 0.848 | 0.919 | 0.747 | 0.815 | 0.774 | 0.860 | 0.905 | claude-3-haiku-20240307 |
| 9 | fis | 0.916 | 0.895 | 0.894 | 0.907 | 0.896 | 0.918 | 0.891 | 0.905 | gpt-3.5-turbo |
| 10 | lan | 0.917 | 0.914 | 0.884 | 0.904 | 0.884 | 0.885 | 0.909 | 0.917 | claude-3-5-sonnet-20241022 |
| 11 | par | 0.925 | 0.905 | 0.863 | 0.867 | 0.830 | 0.888 | 0.885 | 0.922 | claude-3-5-sonnet-20241022 |
| 12 | sub | 0.902 | 0.919 | 0.895 | 0.758 | 0.851 | 0.889 | 0.906 | 0.911 | claude-3-7-sonnet-20250219 |
| 13 | vil | 0.881 | 0.880 | 0.914 | 0.901 | 0.873 | 0.927 | 0.895 | 0.887 | gpt-3.5-turbo |
These plots show a simulated deliberation across all 12 roles for each surveys and model. Each simulated deliberation has 60 participants (12 roles with 5 participants each).
Note that bottom models are visually inconsistent.
Figure 7.3: DRI Plots
These plots show a simulated deliberation across all models in the same class (i.e., top, bottom) for each role and survey. Each simulated deliberation has 20 participants (4 models with 5 participants each).
Note that top models are visually more consistent than bottom models.